20 research outputs found
Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech
We describe a statistical approach for modeling dialogue acts in
conversational speech, i.e., speech-act-like units such as Statement, Question,
Backchannel, Agreement, Disagreement, and Apology. Our model detects and
predicts dialogue acts based on lexical, collocational, and prosodic cues, as
well as on the discourse coherence of the dialogue act sequence. The dialogue
model is based on treating the discourse structure of a conversation as a
hidden Markov model and the individual dialogue acts as observations emanating
from the model states. Constraints on the likely sequence of dialogue acts are
modeled via a dialogue act n-gram. The statistical dialogue grammar is combined
with word n-grams, decision trees, and neural networks modeling the
idiosyncratic lexical and prosodic manifestations of each dialogue act. We
develop a probabilistic integration of speech recognition with dialogue
modeling, to improve both speech recognition and dialogue act classification
accuracy. Models are trained and evaluated using a large hand-labeled database
of 1,155 conversations from the Switchboard corpus of spontaneous
human-to-human telephone speech. We achieved good dialogue act labeling
accuracy (65% based on errorful, automatically recognized words and prosody,
and 71% based on word transcripts, compared to a chance baseline accuracy of
35% and human accuracy of 84%) and a small reduction in word recognition error.Comment: 35 pages, 5 figures. Changes in copy editing (note title spelling
changed
Automatic detection of discourse structure for speech recognition and understanding.
We describe a new approach for statistical modeling and detection of discourse structure
for natural conversational speech. Our model is based on 42 âDialog Actsâ (DAs),
(question, answer, backchannel, agreement, disagreement, apology, etc). We labeled
1155 conversations from the Switchboard (SWBD) database (Godfrey et al. 1992) of
human-to-human telephone conversations with these 42 types and trained a Dialog Act
detector based on three distinct knowledge sources: sequences of words which characterize
a dialog act, prosodic features which characterize a dialog act, and a statistical
Discourse Grammar. Our combined detector, although still in preliminary stages, already
achieves a 65% Dialog Act detection rate based on acoustic waveforms, and 72%
accuracy based on word transcripts. Using this detector to switch among the 42 Dialog-
Act-Specific trigram LMs also gave us an encouraging but not statistically significant
reduction in SWBD word error
Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech?
Identifying whether an utterance is a statement, question, greeting, and so forth is integral to effective automatic understanding of natural dialog. Little is known, however, about how such dialog acts (DAs) can be automatically classified in truly natural conversation. This study asks whether current approaches, which use mainly word information, could be improved by adding prosodic information. The study examines over 1000 conversations from the Switchboard corpus. DAs were handannotated, and prosodic features (duration, pause, F0, energy and speakingrate features) were automatically extracted for each DA. In training, decision trees based on these features were inferred; trees were then applied to unseen test data to evaluate performance. For an allway classification as well as three subtasks, prosody allowed highly significant classification
over chance. Featurespecific analyses further revealed that although canonical features (such as F0 for questions) were important, less obvious features could compensate if canonical features were removed. Finally, in each task, integrating the prosodic model with a DAspecific
statistical language model improved performance over that of the language model alone. Results suggest that DAs are redundantly marked
in natural conversation, and that a variety of automatically extractable prosodic features could aid dialog processing in speech applications
Generating Event Descriptions with SAGE: a Simulation . . .
The SAGE system (Simulation and Generation Environment) was developed to address issues at the interface between conceptual modelling and natural language generation. In this paper, I describe SAGE and its components in the context of event descriptions. I show how kinds of information, such as the Reicbenbachian temporal points and event structure, which are usually treated as unified systems, are often best represented at multiple levels in the overall system. SAGE is composed of a knowledge representation language and simulator, which form the underlying model and constitute the "speaker"; a graphics component which displays the actions of the simulator and provides an anchor for locative and deictic relations; and the generator SPOKESMAN, which produces a textual narration of events
Recommended from our members
The generation gap : The problem of expressibility in text planning
This thesis identifies and provides a solution for a particular problem in natural language generation: the problem of ensuring the expressibility of a text plan. Natural language generation is the process of going from a representation of a situation to a textual expression of some relevant portion of that situation in a natural language. Generation systems must have a principled way of ensuring that the message composed by the text planner is expressible in language, that is, that there are linguistic resources (words, syntactic structures) available for the linguistic component to realize the elements of the plan, and their composition is in accordance with the rules of composition in the language. I have addressed the problem of expressibility by designing a level of representation, the Text Structure, which is used by the text planner in composing the utterance. This intermediate level of representation bridges the generation gap between the representation of the world in the application program and the linguistic resources provided by the language. The terms and expressions in the Text Structure are abstractions over the concrete resources of language (the words, morphological markings, syntactic structures, etc. that actually appear in a stream of text). These abstract linguistic resources group together the expressible combinations of concrete linguistic resources. I have identified three kinds of information that are essential to an abstract linguistic representation: the constituency, the semantic category of the constituent (e.g. event, property), and the structural relations among the constituents (e.g. argument, adjunct). By providing the planner with a set of abstract resources, rather than letting it choose from the individual features that make them up, it is prevented from choosing a set of features that is not realizable. These abstractions can further constrain composition by defining what kinds of constituents can be extended and how semantic categories can compose. Text Structure is implemented in the Spokesman Generation System, which produces text for a variety of application programs. I describe in detail the structures in Spokesman\u27s text planner and walk through an example of the generation of a biographical paragraph from the Main Street simulation program